Red Wine Quality Exploration by Edmund Wong

This report explores a dataset containing quality and attributes for approximately 1600 red wines. This dataset contains 12 variables. 11 of the 12 variables are chemical properties of the wine. (The first column ‘X’ is just for indexing, which could be dropped)

Univariate Plots Section

## [1] 1599   12
## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

For this red wine dataset, I will begin examining the distributions of each of these variables.

The fixed acidity distribution appears slightly positive skewed, peaking at around 7.5 g/dm^3

Most red wines have a volatile acidity between 0.25 to 0.75 g/dm^3.

The citric acid distribution is positively skewed, peaking at zero citric acid.

Since the histogram of citric acid is right skewed, I transformed the data using log transform. The program mentions that 132 values were removed because they contain non-finite values. We can deduce that there are 132 red wines that are measured to have exactly zero citric acid because a log of zero becomes negative infinity and the minimum value of citric.acid is 0.0.

Most of the values of residual sugar lie between 1.5 to 3 g/dm^3. There are noticeable outliers that are above 10 g/dm^3.

The distribution of chlorides in red wine look normal with values mainly between 0.04 to 0.13 g/dm^3. There are clearly noticeable outliers above 0.3 g/dm^3.

The free sulfur dioxide distribution is right skewed, with values mainly between 0 to 30 mg/dm^3.

A log transformed was performed on the free sulfur dioxide distribution.

The total sulfur dioxide distribution is right skewed, with values mainly between 0 to 75 mg/dm^3. I wonder if there is a relationship between free sulfur dioxide, total sulfur dioxide, and sulfates. Would they have an effect on red wine quality?

A log transformed was performed on the total sulfur dioxide distribution. What about the non-free (bound) sulfur dioxide of the red wine? Does it have any relationships with quality? I’m going to obtain non-free sulfur dioxide from subtracting the amount of free sulfur dioxide from total sulfur dioxide. I will expect the data to be right skewed so I will do a log transform of the non-free free sulfur dioxide distribution.

The density of red wine is very close to that of water. The bulk of the density distribution lie between 0.9930 to 1.0000 kg/m^3.

The red wine pH distribution looks like a normal distribution with most values between 3 to 3.6. There may be a connection of fixed acidity, volatile acidity, and citric acid with pH.

Theres a relatively normal distribution with sulphates with the majority of values between 0.4 to 0.8 g/dm^3. There are a few outliers above 1.5 g/dm^3.

The bin with the greatest number of red wines is between 9.4 to 9.5% alcohol by volume. The distribution seems slightly right skewed.

All quality values are integers with most red wines being rated a 5 or a 6.

Univariate Analysis

What is the structure of your dataset?

There are 1599 red wines in the dataset with 11 physical properties and 1 subjective variable as features. (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, and quality) All variables are continuous variables except quality, which is an ordered factor variable. My observations from the univariate analysis include:

+ The median volatile acidity is 0.52g/dm^3.
+ Most red wines are acidic, mainly having a pH between 3 to 3.5.
+ All quality scores are integers between 3 to 8, with most scores being 5 or 6. The median score is 6 and the mean is 5.636. 
+ Most red wines have densities between 0.99 to 1, which is slightly less than or almost equal to the density of water.
+ Most red wines have alcohol content between 9-12%.

What is/are the main feature(s) of interest in your dataset?

The main features of interest in my dataset are alcohol, volatile acidity, sulphates, and quality. I am interested in trying to find out which attributes of red wine are useful in predicting the quality of red wine.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Density, pH, and chloride are other features in the dataset that I think will help support my investigation in predicting the quality of red wine.

Did you create any new variables from existing variables in the dataset?

I created non-free sulfur dioxide, the bound form of sulfur dioxide. Non-free sulfur dioxide is calculated from simply subtracting free sulfur dioxide from total sulfur dioxide. This new variable could possibly give us more it relates to quality of red wine.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Some of the features such as citric acid, total, free, and non-free sulfur dioxide distributions were right skewed. I log-transformed the data to be able to visualize and understand the distribution better.

Bivariate Plots Section

I used the pairs.panels function from the psych package to perform a Pearson correlation between each of the variables in the dataset. Here I will generate a scatterplot for all red wine alcohol and quality values, as those are my primary features of interest.

The scatterplot above shows a subtle trend where as alcohol content increases, the quality increases as well. This could be more evident if I visualize this using a boxplot.

So I used a boxplot to visualize the same data. At a quality of 5 and up, the increasing trend of alcohol is clear. There seems to be an overall positive correlation between alcohol and quality. The correlation coefficient is calculated.

## [1] 0.4761663

Lets see how does alcohol relates to other variables such as density?

Here, it seems that there is clear a negative trend between alcohol and density. The higher the alcohol content, the lower the density. This is expected because alcohol is less dense than water so we should see the density decrease proportionally as alcohol increases.

## [1] -0.4961798

How about looking at my other primary feature of interest, volatile acidity, and its relationship with quality?

There seems to be a stronger negative trend observed between volatile acidity and quality.

## [1] -0.3905578

The trend stops decreasing from a quality of 7 and greater. As quality increases, it seems that the variance of volatile acidity generally becomes smaller. Let’s further investigate by looking at pH and see how it relates to volatile acidity.

I expected volatile acidity to be strongly correlated with pH. But it didn’t turn out to be significant on this plot. There seems to be only a slight positive correlation between pH and volatile acidity.

## [1] 0.2349373

Let’s move on to see how sulphates relate to wine quality.

There is a slight positive correlation that I notice between sulphates and quality of red wine.

## [1] 0.2513971

The median of sulphates increases for each step increase in quality. The variance of sulphates peak at qualities of 6 or 7. I will not check out how sulphates relate to other variables such as chlorides.

I notice a slight positive trend here but overplotting is evident. I will add some adjustments to improve this plot.

After adding jitter and changing the plot limits, I still notice the same positive trend. There is a mild positive correlation between sulphates and chlorides.

## [1] 0.3712605

After exploring all of my main features of interest, how about looking at the relationship between two of the secondary features of interest: pH and density?

We see a mild negative trend between pH and Density. It looks that if pH increases, the density of red wine generally decreases.

## [1] -0.3416993

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The main features of interest (alcohol, volatile acidity, and sulphates) in this red wine dataset all present some correlation to quality of the red wine. There is positive relationship between alcohol and quality, where if alcohol generally increases then quality increases. There is a steady increase of alcohol after a quality score of 5. The variance of alcohol is smallest at a quality score of 5. Volatile Acidity has a negative relationship with quality. There is decrease in volatile acidity for every step increase in quality. There is also a trending decrease in variance of volatile acidity as quality increases. There is a slight positive trend between sulphates and quality. The median of sulphates increases for each step increase in quality. The greatest median sulphate is at a quality score of 8, whereas the smallest is at a quality score of 3.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

There was small negative trend between pH and density. As pH increases, density seems to decrease.

What was the strongest relationship you found?

The strongest relationship that I found was between alcohol and quality. The correlation seems to be relatively stronger compared the relationship between other variables.

Multivariate Plots Section

I will review each of the three main variables of interest along with a secondary variable of interest and its relationship with quality of red wine.

I see here the positive trend between alcohol and quality. (r=0.48) If we account for a constant quality value, lower density of red wine generally leads to higher alcohol. (r=-0.5)

A clear negative trend between volatile acidity and quality can be seen here. (r=-0.39) Its difficult to see any relationship, if any, between pH and volatile acidity in this plot. (r=0.23)

There seems to be here a slight positive correlation between sulphates and quality. (r=0.25) Holding quality constant, there is a small positive trend between chlorides and sulphates. (r=0.37)

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

There was a positive correlation between alcohol and quality. While holding quality constant, there is a negative relationship between density and alcohol. A negative correlation exists between volatile acidity and quality but not much trend is observable between pH and volatile acidity. A small trend could be observed between sulphates and quality. While holding quality constant, a small trend also be seen between chlorides and sulphates.

Were there any interesting or surprising interactions between features?

It was interesting to notice that strong negative correlation between volatile acidity and the quality of red wine from the plot.


Final Plots and Summary

Plot One

Description One

The plot reveals the observable relationships between density, alcohol, and quality of red wine. This plot was chosen because there were two relatively strong relationships between these variables and alcohol was one of my main factors of interest with quality. There is a negative trend between density and alcohol. (r=-0.5) Quality is generally higher as alcohol percentage increases. (r=0.48)

Plot Two

Description Two

This boxplot highlights the decrease of volatile acidity as the quality of red wine increases. (r=-0.39) The variance also seems to be decrease when quality increases. This plot was chosen because this was the one of the few variables that have shown a consistent direction of change with the median as quality changes.

Plot Three

Description Three

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

The sulphate distribute of all the red wines. Most of the red wines have between 0.5 to 0.75 g/dm3 of sulphates. The distribution is slightly right skewed with a few outliers above 1.5 g/dm3. The plot was chosen because sulphates was one of my main factors of interest from the start.


Reflection

The red wine dataset contains data for 1599 red wines with 12 variables. All but one of these variables are physical properties of the wine. The other variable is a subjective quality measured by a reviewer. By began the analysis by looking at the distributions of each of the features.

Pairs.panels of the psych package gave an overview of all the relationships between each of the features. This overview gave me a good idea to decide which features would be useful for my analysis. It was interesting to see that alcohol percentage had a positive correlation with the quality of the red wine. Low density wines generally had higher alcohol percentage, but it was surprising to see that there was not much of a relationship between density and quality.

Using the GGPlot library in R was not very straightforward in the beginning but as the analysis progressed, everything became more intuitive as it provided alot of flexibility to how I wanted to explore the dataset. Using GGPlot2 library histograms,scatterplots, and boxplots was a valuable learning experience that equipped me with useful data analysis skills.

This analysis can be enriched if there was more red wine data available. This dataset is small compared to the white wine dataset (which has almost 5000 wine wines). Perhaps with enough data, machine learning be applied to predict the quality based on certain features of red wine.